Astronomy & Astrophysics manuscript no. point 


© ESO 2008 


February 5, 2008 





Ly-a forest: efficient unbiased estimation of 
second-order properties with missing data 

R. Vio 1 , V. D'Odorico 2 , H. Stoyan 3 , and D. Stoyan 3 

1 Chip Computers Consulting s.r.L, Viale Don L. Sturzo 82, S.Liberale di Marcon, 1-30020 Venice, Italy 
e-mail: robertovio@tin.it 

2 INAF - Osservatorio Astronomico di Trieste, via G.B. Tiepolo 11, 1-34143 Trieste, Italy 
e-mail: dodorico@oats . inaf . it 

3 Institute of Stochastics, TU Bergakademie Freiberg 
D-09596 Freiberg, Germany 

e-mail: stoyan@math . tu-freiberg . de 

Received ; accepted 

ABSTRACT 

Context. One important step in the statistical analysis of the Ly-a forest data is the study of their second order properties. Usually, this is 
accomplished by means of the two-point correlation function or, alternatively, the AT-function. In the computation of these functions it is 
necessary to take into account the presence of strong metal line complexes and strong Ly-a lines that can hidden part of the Ly-a forest and 
represent a non negligible source of bias. 

Aims. In this work, we show quantitatively what are the effects of the gaps introduced in the spectrum by the strong lines if they are not 
properly accounted for in the computation of the correlation properties. We propose a geometric method which is able to solve this problem 
and is computationally more efficient than the Monte Carlo (MC) technique that is typically adopted in Cosmology studies. The method is 
implemented in two different algorithms. The first one permits to obtain exact results, whereas the second one provides approximated results 
but is computationally very efficient. The proposed approach can be easily extended to deal with the case of two or more lists of lines that have 
to be analyzed at the same time. 

Methods. Numerical experiments are presented that illustrate the consequences to neglect the effects due to the strong lines and the excellent 
performances of the proposed approach. 

Results. The proposed method is able to remarkably improve the estimates of both the two-point correlation function and the A"-function. 
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1. Introduction 

The study of the absorption lines observed in the optical spectra 
of high redshift quasars has greatly evolved in the last decade 
thanks mainly to the advent of the new class of lOm-telescopes 
coupled with high resolution, high performance spectrographs 
and to the improved comprehension of the physical nature of 
the absorbers. 

The lines forming the so called "Ly-a forest" observed 
blue ward of the Ly-a emission of the quasar are due to 
the Ly-a transition in neutral hydrogen distributed along the 
line of sight. They originate in the fluctuations of the inter- 
mediate and low density intergalactic medium (IGM), aris- 
ing n aturally in the h i erarchical process of structure formation 
(e.g.JCen et al.lll994lzhang et al.lll995llHernauist et al.ll 1 9961: 
Mira lda-Escude et alJll996t iBi & Davidsenlll997t bave et al.1 
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199 71: IZhanget alJll997t iTheuns. Leonard & Efstafhioulll998 : 
Machacek et al Thus, the Ly-a forest provides a unique 

and powerful tool to study the distribution/evolution of bary- 
onic matter and the physical status of the IGM over a wide 
redshift interval. 

In analogy with the study of galaxies, the two-point correla- 
tion function has been classically used to st udy the distribution 
of the Ly-a lines (e.g. ISargent et al. 1980h . However, the Ly- 
a distribution is contaminated in the observed spectra by metal 
transitions associated with galactic haloes pierced by the line of 
sight. The stronger metal lines can extend for several angstrom 
covering weaker Ly-a lines and causing a bias in their distribu- 
tion. The same is true for strong Ly-a absorptions, in particular, 
Lyman limit and damped Ly-a systems. On the other hand, the 
distribution at small scales is biased by the intrinsic width of 
the lines (~ 25 - 30 km s" 1 ). 
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Observers have become more and more aware of those 
problems in particular for high resolution spectra and at large 
redshifts where the crowding of the lines increases (dn/dz °c 
+ z) 2 ' 5 , e.g., iKim et al.1 120021) . In the mos t recent papers 



(1 

based on high resolution quasar spe ctra (e.g.. iLu et al.lll996t 
Cristiani et al.ll 1997 ; Kim et al1l2001 ) the strong Ly-a lines are 
eliminated from the computation of the two-point correlation 
function and the lines closer than a minimum velocity separa- 
tion are merged. In order to correct for edge effects and pres- 
ence of the metal gaps in the line distribution, the observed 
distribution of line pairs is compared with Monte Carlo (MC) 
simulations of lines which are randomly distributed in the ob- 
served redshift interval other than the cosmological increase 
with redshift. The same gaps due to the cut-out absorptions are 
reported also on the simulated spectra and the minimum sepa- 
ration between two lines is preserved. The simulated process is 
approxim ately of Poisso n type over small redshift separations 
(see lSargentetalJll980h . 



A further approximation is introduced in this method be- 
cause it is not the redshift split but the velocity split distri- 
bution that is considered, where the separation Avy between 
two absorption lines of redshifts z,- and Zj is given by the non 
linear formula Avy = c|(z,- - Z/)|/[l + (z; + Z/)/2]. Since, if 
Avn > Avn, AV23, it is Avi3 + Avi2 + AV23, the quantities Av,-y 
do not represent Euclidean distances. Hence, from the quan- 
tities [Avij}f. =i it is not even possible to construct the one- 
dimensional point process corresponding to the sequence of 
the velocities vi, Vi, . . . , v^. The use of velocities at the place 
of comoving separations was suggested mainly by the un- 
known ratio between peculiar velocity and Hubble flow con- 
tributions to the measured Ly-c line redshifts. There is now ev- 
iden ce that peculiar velocities are negligible for those absorbers 
(e.g. jRauch et al.l2005l;[D'Odorico et al.l2006l) , which makes it 
preferable to use comoving distances to compute the two-point 
correlation function. 

The computation of a comoving distance corresponding to 
a given z requires the evaluation of an integral that in general is 
not available in closed form. Hence, a numerical approach has 
to be adopted. This can make the MC methods very time con- 
suming. For this reason, in this paper we present a geometric 
method to compute the two-point correlation that does not re- 
quire the generation of auxiliary random samples and therefore 
is much more efficient. 

In Sec.[2]the computation of the two-point correlation func- 
tion and of the /f-function is considered when only edge effects 
are present. In Sec. [3] and Appendices lAllDl the arguments are 
generalized to deal with the case of gaps in the data sequence 
and two algorithms are proposed. Two experimental data sets 
are analyzed in Sec. |U Finally, conclusions are presented in 
Sec.|6l 

Throughout this paper we will adopt for the cosmological 
parameters: Qo m = 0.26, £2oa = 0.74, and Hq = 73 km s _1 
Mpc -1 . 



2. Estimation of K(r) and £(r) in presence of edge 
effects 

A one-dimensional spatial point process X is any stochastic 
mechanism that generates a countable set of events x = {*;}"_, 
on the real axis R. It is called stationary if the distribution of X 
is invariant under translations, that is, the distribution of X + s 
is the same as that of X for any set. Usually, stationary point 
processes are described by their first-order and second-order 
characteristics. The first-order characteristic, say A, equals the 
mean number of points per unit of length and is often called 
intensity. The second-order characteristics are represented by 
the two-point correlation function %(r) and the pair correlation 
function g(r) that satisfy 



*(r)=l+£(r). 



(1) 



The function g(r) is defined as follows. Consider any infinitesi- 
mal interval B of length dL. The probability of having a point in 
B is AdL. Consider now two such intervals B\ and B^ of lengths 
dL\ and dLi and intercenter distance r. The probability to have 
a point in each interval can be denoted by P(r) and is expressed 



P(r) = g(r)A 2 dL x dL 2 . 



(2) 



The factor of proportionality g(r) is the pair correlation func- 
tion. It is clear that, in the case of complete randomness, 
g(r) = 1 or £(r) = 0, independently of r. g{r) > 1, or 
£(r) > 0, indicates that pairs of points with distance in the in- 
terval [r - dr/2, r + dr/2] are more likely to occur than for 
a Poisson process with the same intensity, that means there is 
clustering. On the contrary, g(r) < 1 or £(r) < correspond to 
inhibition or regularity. 

The estimation of g(r) and £(r) presents a practical diffi- 
culty. In fact, for a given distance r, the estimation of these 
characteristics requires evaluation of the number of point pairs 
in the infinitesimal interval [r-dr/2, r + dr/2]. Since this is im- 
possible, a finite interval [r-h,r + h] has to be used. This leads 
to approximation errors, especially for small r when it could 
happen that r — h < 0, and to the problem of choosing the band- 
width h. In particular, the determination of th e best value of 
h to u se for a specific problem is still an "art" {Martinez et al 



2005). The best recipe is to consider a series of bandwidths 
starting from h » \/A and then to compare the results; some- 
times it is recommendable to use "adapted" bandwidths, i.e. 
different values of h for small and large r. In order to avoid this 
kind of problem, sometimes statisticians use also the so-called 
TT-function, K(r), that is related to %(r) through 



K(r) 



or 



- 2 f 

Jo 



[1 +tj(u)]du, 



a , 1 dK(r) 



(3) 



(4) 



Here, no bandwidth has to be fixed. The main drawback of 
this approach is that K(r) is a cumulative function, which in- 
creases nearly linearly. (A comparable situation exists in clas- 
sical statistics for the pair "probability distribution function" 
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F(x) and the "probability density function" f(x).) For this rea- 
son, a plot of K(r) is more difficult to interpret than a plot 
of %(r). As a consequence, £(r) is more useful in exploratory 
statistics, whereas K(r) is useful in testing statistical hypothe- 
ses, e.g., that a given point sequence is "completely random" 
or belongs to a Poisson process. 

If a point process was observable over the entire K domain, 
the estimation of K(r) and £(r) could simply follow their def- 
initions. But in general, a point process can be observed only 
in a finite region W defined in the interval [xmi n , x max ] c R, the 
observing window. Then edge effects play an important role, 
since the estimation of K(r) and £(r) should require the use of 
points external to W. Hence, the "natural" estimator 



I Wl 

R(r) = y z z 7 ^-^D' 

with \ W\ the length of W and 



7 r («) = 



1 if u < r 
otherwise, 



(5) 



(6) 



will yield results too small, with a negative bias. The same 
holds for the "natural" estimator of £(r) 



\W\ 2 " " 



^ = ^Z Z k h (r-\ Xi - Xj \)-l, 



where 



k h (u) = 



l/2h if -h<u<h 
otherwise. 



(7) 



(8) 



Figures Q]|2] show that the mentioned biases are really sig- 
nificant also in the one-dimensional case considered here. As 
expected, the larger r the larger the biases. This is due to the 
increasing number of points external to W that, for increasing 
values of r, should be necessary to estimate both K(r) and £(r). 
Although not clearly visible in Fig. [2] %(r) presents a bias also 
when r < h. This is because in the computation of £(r) only the 
points in the interval [0, r + h] of length shorter than 2h can be 
considered. 

In Astronomy, especially in the field of the statisti- 
cal analysis of the systems of galaxies, this edge-correction 
problem is well known. Various algorithms, mostly of 



dimensional case (e.g., see 


Hamilton 


1993; 


Landv & Szalav 


19931: lOuashnock & Sein 


1999; Pons-Borderfa et al. 1999; 


Kerscher II 19991 iKerscher et al. 112000: Martinez & Saar 2002. 



and references therein). Their conversion to the one- 
dimensional case is straightforward. However, the fact to 
work in the o ne-dimensional cas e allow s us to follow the ap- 
proach due to IStoyan & Stoyan (2000) which is a modifica- 
tion of the MC metho ds presented in Ha milton (1993) and 
Landy & Szalav Idl993l) . In practice, the auxiliary random sam- 
ples of points used in those papers are replaced by geometrical 
terms that result from exact integration. This permits to obtain 
more precise results since the quantities of interest are calcu- 
lated and not only estimated as in the case of the MC tech- 
niques. Moreover, there is a remarkable benefit in the compu- 
tational efficiency. In fact, in order to provide satisfactory re- 
sults, the MC methods require the use of a number of auxiliary 
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Fig. 1. Comparison of the biased estimator K(r) with the unbi- 
ased one K(r) with adapted intensity Av(r). The data used in the 
simulation consist of 200 points drawn from a Poisson process 
in the interval [0, 1 ] . 

Twe-poinl cdrrelalion functien 
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Fig. 2. Comparison of the biased estimator g(r) with the unbi- 
ased one ~g(r) with adapted intensity As(r). The data used in the 
simulation are the same of Fig.[T] h = 0.05. 



random samples much larger than the numb er of experimental 



point s with obvious computational costs (see Mar tinez & Saar 
2002). 



We start from the unbiased rigid motion corrected estima- 
tors of A 2 K(r) and A 2 £(r): 



A 2 K(r) = \W\J] J] ppAHxt - xj\), 

where the weights p,j are given by 

\WnW Xi - Xj \ 



(9) 



Pa = 



\w\ 



(10) 
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W s is the observing window shifted by the amount s. If W is an 
interval, it is simply \W n W Xi - x .\ = \W\ - |jc; - Xy|. Similarly: 



XI XI Pij k h{ r -\ x i-Xj\)-A 2 



(ID 



Here, c(r) = 2/i/(r + /z - max[0, r - h]) is a cor recting facto r 
for the values of r for which r - h goes negative ( iGuan 2006). 
A "naive" method used by many authors to obtain estimates of 
K(r) and £(r) is to set A 2 = n 2 /\W\ 2 and then to divide the esti- 
mators © and (fTTT i by this quantity. However, the circumstance 
that n/\W\ is a good estimator of A does not necessarily im- 
plies that the same ho lds for n 2 /\W\ 2 wit h res pect to A 2 . In fact, 



more is possible. As Hamilton ( 1993b and iLandv & Szalav I 



it is possible to reduce the mean squared error (MSE) Q if, in- 
stead of the naive classical estimator, adapted estimators for A 
are used. One particularity of these estimators is that they are 
different for K(r) and £(r) and depend on the distance r. They 
are denoted by As(r), for £(r), and Ay(r), for K(r). The index 
"S " refers to surface an d "V" to volume (for more details, see 
Stovan & Stovan 2000). The expression for the two estimators 
is given by 



and 



lw(xj -r) + IwpCj + r) 
2\WC\W r \ 



, , ^ v \Wn[xi-r,Xi + r]\ 
2f \wnw t \dt 



with 



1 ifxeW, 
otherwise. 



(12) 



(13) 



(14) 



Figures Q][2] show the remarkable improvement in the estimate 
of both K(r) and f(r). 

3. Estimation of K{r) and f (r) in presence of gaps 

In practical applications, the edge effects are not the only prob- 
lem. In fact, often the observing window is a union of intervals 
separated by gaps. In the spectra of quasars this is due to the 
presence of metal lines and strong Ly-a systems that can hid- 
den portions of the Ly-a forest. Ignoring these gaps and tak- 
ing erroneously the interval starting with the most left interval 
start-point to the most right interval endpoint as the observ- 
ing window produces biased estimates. Clearly, the estimators 
will tend to report more clustering than really existing. The rea- 
son is simply that when the points within some given regions 
are removed, then this artificially creates separated groups of 
points that mimic the presence of clusters. As a consequence, 
an erroneous indication of clustering can happen even in the 
case of no-interaction or complete spatial randomness in the 
point pattern. On the other hand, if clustering or aggregation is 



1 The mean squared error of an estimator £(x) of £(r) is given by 
MSEj(r) = E[(J(r) - <f(r)) 2 ]. A similar expression holds for MSE^(r). 



really present, a higher degree of clustering than true may be 
obtained. These effects are clearly visible in Figs. [30 where 
the performance of the estimator K(r) is shown for a Poisson 
process, typically of complete random spatial patterns, and a 
Cox one tha t models spatial patterns with significant density 
fluctuations (IMoller & Waagepetersen 1 l2004l) . 

In presence of gaps the unbiased estimators K(r) and £(r) 
and the intensities Ay(r) and As (r) can still be used. Here, how- 
ever, a computational problem arises. In fact, although the com- 
putation of the quantities \W n Wx t - Xj \, A$(r) and Ay{r) needs 
only elementary mathematics, nevertheless it is impossible to 
give explicit formulas. For small r, a simple approximation is 
given by 



d 19931) showed for £(r) and lStoyan & Stovan I d2000h for KM. \WnW r 



= L-(n w + l)r, 



(15) 



where L is the distance between the most left observation inter- 
val endpoint and the most right observation interval endpoint 
and n w is the number of gaps. For larger r, the exact algorithm 
presented in Appendix[A]can be used. As an alternative, the ap- 
proximated but efficient algorithm based on a Fourier technique 
in Appendix[B]is proposed. After that, it is possible to compute 
As(V) and then ^(r). The computation of Ay{r) is more compli- 
cated and can be carried out by numerical integration. However 
a more efficient approach, based again on a Fourier technique, 
is described in Appendices iBllCl 

The excellent performance of the estimator K(r) combined 
with the intensity Ay(r) is shown in Figs.[5][6](to compare with 
Figs. [301. Figures. [7j|8] show the results obtainable with £(r) 
combined with As(r). The number of points used in this sim- 
ulations is of the same order of that typically expected for the 
Ly-a lines in a single spectrum. In order to check if these results 
are consistent with what expected from the theory, in Fig.[9]the 
theoretical MSEf(r) of the £(r) estimator combined with As{r) 
for the experiment with the Poisson process used for Fig. [7] is 
compared with that estimated on the basis of 5000 simulations; 
apart from the very small values of r, where the effects due to 
the bandwidth h are more important, the agreement is reason- 
ably good. Worth of note is the non-monotonic behavior of the 
two MSEs that is directly linked to the fact that the observing 
window W is constituted by disjoint segments. The equation 
used for MSE f (r), 



MSE,(r) = y + 1 , 
* KJ 2hA 2 \WC\W r \ 

is due to lStovan et 1071(119931) . 



(16) 



4. A practical application 

In order to apply the discussed method and verify its reli- 
ability, we have used the best high resolution high signal- 
to-noise quasar spec tra publicly avail able observed with the 
UVES spectrograph dDekker et al.ll2000h at the VLT telescope 
in the context of the ESQ Large Program me "The Cosmic 
Evolution of the IGM" (Ber geron et al ]|2004 . In particular, we 
have chosen two examples of Ly-a forest contaminated by a 
large number of metal transitions: PKS0237-23 at z em = 2.233 
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and PKS2126-158 at z em = 3.267, for which the line blanket- 
ing due to the same Ly-a lines is more severe as the emission 
redshift is larger. 

The quasar spectra have a resolution R 45000 and a 
signal-to-noise ratio in the Ly-a forests S /N - 80 per pixel. 
They have be en reduced with the pipeline of the instrument 
(version 2.1, Ball ester et al. 2000) provided by ESO in the 
context of the data reduction package MIDAS. The continuum 
level was determined with a manual subjective method based 
on the selection of the regions free from clear absorption that 
are successively fitted with a spline polynomial of 3rd degree. 
The normalized spectra are then inspected to identify metal ab- 
sorption systems with a particular attention to the transitions 
that fall in the Ly-a forest. Once the forest has been cleaned, 
the H i lines are fitted with Voigt profiles in the LYMAN context 
of the MIDAS reduction package. In order to increase the relia- 
bility of the fit parameters for the saturated Ly-a lines we used 
also the other unblended lines of the Lyman series falling in the 
observed spectral range. We excluded 1000 km s _1 from the 
Ly-p emission and 5000 km s _1 from the Ly-a emission of the 
quasar to avoid associated absorbers and the proximity effect 
due to the quasar itself. Furthermore, we cut out of the spectra 
the regions occupied by the metal transitions and the Lyman 
limit systems (Ly-a lines with column density 7V(HI) > 10 17 
cm -2 ) which could mask the presence of weaker lines in their 
velocity profiles (line blanketing). At the smallest scales the 
clustering properties are affected by the typical velocity width 
of the lines (~ 25 - 30 km s"\ e.g. Kim et al. 2001), for this 
reason we have merged the line pairs closer than 25 km s _I . 

Figures [T0lfT3l show K(Ar) and £(Ar) as a function of co- 
moving spatial separation, Ar, for the Ly-a forest data of 
the objects PKS0237-23 and PKS2126-158. For PKS0237- 
23 the number of Ly-a lines is 191 and the total length of 
gaps amounts to about 16% of the interval spanned by W, 
whereas for PKS2126-158 these quantities are 300 and 20%, 
respectively. For reference, the corresponding functions edge- 
corrected but evaluated without considering the gaps are also 
displayed. In both the cases, it is evident the stronger indication 
of clustering if the gaps are not taken into account. For the same 
experiment, Figs. [T4"l[T5l show the comparison between £(Ar) 
and the estimate of the t wo-point correlation fun ction obtained 
with the MC method by lLandy & Szalay (1993). In this last 
case a number of No = 50 data sets have been generated that 
contain a number of random samples equal to that of the origi- 
nal ones. Nd has been chosen large enough that the mean of the 
estimated two-point correlation functions does not change sig- 
nificantly when more data sets are added. The good agreement 
is evident. However, if the computation of £(Ar) takes a total 
of 1-2 seconds of CPU timeU, the MC method requires a simi- 
lar amount of time for each random sample. Now, if one takes 
into account that usually the bandwidth h has to be determined 
by a trail and error approach, also with the limited size of the 
data sets here used, the computational advantage of the method 
presented in this paper is obvious. This is especially true if two 



or more lists of lines have to be analyzed at the same time (see 
below). 

5. Multi-list analysis 

Here, we briefly illustrate an additional benefit in using the es- 
timator %(r). In particular, the fact that it can be easily gen- 
eralized for the estimation of a two-point correlation function 
when two or more lists of lines are available. The main problem 
is that, in an experimental context, the lists can have different 
characteristics concerning the covered spatial intervals and/or 
the total extension of the gaps. Hence, the naive estimator 



Sir) 



(17) 



with ^i(r) the estimate obtained from the Zth data set, can 
provide unsatisfactory results since all the lists have the 
same importance in forming the final result. Some kind of 
weighting is necessary. In this respect, it has been proved 
(Ohser & Muecklich 2000) that the estimator 



S(r) = 



Zf=iiwrw f |,fl(r) 
Ef = i \w n WA, 



(18) 



provides an estimate of £(r) with a smaller variance with re- 
spect to H(r) for each value of r. With the numerical techniques 
described in Sects. iBllDl the implementation of the algorithm 
for the computation of H(r) is trivial. However, the character- 
istics and the performances of this estimator are beyond the 
scope of the present paper. 

6. Conclusions 

In this work we have analyzed possible sources of bias in the 
estimation of the second-order characteristics of the Ly-a for- 
est. In particular, the metallic and the strong Ly-a lines have 
been considered that can hidden part of the lines of interest. 
This may result in a severe bias of both the Zf-function and 
the two-point correlation function. Specifically, an erroneous 
indication of clustering can happen even in the case of no- 
interaction or complete spatial randomness in the line pattern 
or, if a clustering is really present, a higher degree of cluster- 
ing than true can be obtained. In order to handle this problem, 
we have proposed a computational efficient method that has 
been implemented in two different algorithms; one that is able 
to provide exact results and the other one that is approximated 
but very fast. Its excellent performances have been validated 
through numerical simulations and an application to real data. 
We have also presented an extension of the method for the esti- 
mation of the two-point correlation function in the case of two 
or more lists of lines that have to be analyzed at the same time. 



2 Experiments have been conducted with Matlab 6 on a Pentium IV 
- 1500 GHz processor in a Windows 2000 operating system. 
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- Biased 
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Fig. 3. Mean and 68% (i.e., 1-cr) confidence interval for the 
biased estimator K(r), with periodic boundary conditions, ob- 
tained from 100 realizations of a Poisson process with gaps. 
The data sets has been simulated by drawing 200 points from 
a Poisson process in the interval [0, 1] and discharging those 
that are in subset [0.20,0.25] U [0.30,0.35] U [0.50,0.55] U 
[0.70, 0.75] U [0.80, 0.85]. For reference also the ^-function of 
the Poisson process is shown ("true" line). Because of the im- 
posed period boundary conditions, estimator K(r) is free from 
the edge effects. 



K-function / biased estimator 




Fig. 5. As in Fig. |3]but for the unbiased estimator K(r) and the 
adapted intensity Ay(r). 
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Fig. 6. As in Fig. |4]but for the unbiased estimator K(r) and the 
adapted intensity Ay(r). 
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Fig. 4. The same as in Fig.[3]but for a Cox process simulated 
in the interval [0, 1] by taking the absolute value of a station- 
ary Gaussian random field with a Gaussian correlation func- 
tion whose dispersion is set to 0.01. The Gaussian random 
fields have been generated under the hypothesis of periodic 
boundary conditions. Because of this, the line labeled "True", 
that provides the mean value of the estimator K(r) with peri- 
odic boundary conditions, is free from the edge effects. This 
mean has been computed on the basis of 5000 simulations with 
W = [0,1]. For reference, also the ^-function of a Poisson 
process is shown. 
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Fig. 7. As in Fig. rjbut for the unbiased estimator f (r) and the pig g Estimated MSEf (r) (calcu i ated for eleven values of 



adapted intensity A s (r). h = 0.01 



Two-point correlation function 




r) 

vs. the theoretical one for the experiment in Fig. [7] Notice the 
non-monotonic behavior of the two functions that is directly 
linked to the fact that the observing window W is constituted 
by disjoint segments. 



0.01 0.02 0.03 0.04 0.05 0.06 0.07 



Fig. 8. As in Fig.|4]but for the unbiased estimator £(r) and the 
adapted intensity A s (r). h = 0.005. 
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Unbiased 
Biased 



Ar (comoving Mpc) 

Fig. 10. K(Ar) for the data of PKS0237-23 when the gaps due 
to the metal and the strong Ly-a lines are taken into account 
(unbiased estimator) as a function of the comoving spatial sep- 
aration, Ar. The largest value of Ar in the figure corresponds to 
about 5% the interval spanned by this quantity. The total exten- 
sion of gaps amount to about 16% of spatial interval covered 
by W. For reference, the same estimate is shown when gaps are 
neglected (biased estimator). 




3 4 5 

Ar (comoving Mpc} 



Fig. 12. K(Ar) for the data of PKS2126-158 when the gaps due 
to the metal and the strong Ly-a lines are taken into account 
(unbiased estimator)as a function of the comoving spatial sep- 
aration, Ar. The largest value of Ar in the figure corresponds to 
about 5% the interval spanned by this quantity. The total exten- 
sion of gaps amount to about 20% of spatial interval covered 
by W. For reference, the same estimator is shown when gaps 
are neglected (biased estimator). 



Unbiased 
Biased 




Ar (comoving Mpc) 

Fig. 11. f(Ar) for the data of PKS0237-23 when the gaps due 
to the metal and the strong Ly-a lines are taken into account 
(unbiased estimator) as a function of the comoving spatial sep- 
aration, Ar. The largest value of Ar in the figure corresponds to 
about 5% the interval spanned by this quantity. The total exten- 
sion of gaps amount to about 16% of spatial interval covered 
by W and h = 0.001 in the same units. For reference, the same 
estimate is shown when gaps are neglected (biased estimator). 




Ar (comoving Mpc) 

Fig. 13.£(Ar) for the data of PKS2126-158 when the gaps due 
to the metal and the strong Ly-a- lines are taken into account 
(unbiased estimator) as a function of the comoving spatial sep- 
aration, Ar. The total extension of gaps amount to about 20% 
of spatial interval covered by W and h = 0.001 in the same 
units. For reference, the same estimator is shown when gaps 
are neglected (biased estimator). 
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Ar (comoving Mpc) 

Fig. 14. J(Ar) for the data of PKS0237-23 when the gaps due 
to the metal and the strong Ly-a lines are taken into account 
(unbiased estimator) as a function of the comoving spatial sep- 
aration, Ar. The largest value of Ar in the figure corresponds to 
about 5% the interval spanned by this quantity. The total exten- 
sion of gaps amount to about 16% of spatial interval covered 
by W and /; = 0.001 in the sam e units. For reference, th e result 



provided by the MC method bv lLandv & Szalav I d 19931) is also 
shown (unbiased MC estimator). 



a 




Ar (comoving Mpc) 

Fig. 15. J(Ar) for the data of PKS2126-158 when the gaps due 
to the metal and the strong Ly-a lines are taken into account 
(unbiased estimator) as a function of the comoving spatial sep- 
aration, Ar. The largest value of Ar in the figure corresponds to 
about 5% the interval spanned by this quantity. The total exten- 
sion of gaps amount to about 16% of spatial interval covered 
by W and h = 0.001 in the same units. For reference, th e result 
provided by the MC method bv lLandv & Szalav 1 (1 19931) is also 
shown (unbiased MC estimator). 
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< r >- 



a„ b„ 



|W| =0.8 



|W | = 0.65 



|WnW| = 0.45 



end 

il = il + 

zi = bin 

z2 = a[i2 

end 

else if zl <y2 then 

if zl > yl 
s 

else 

s 

end 



i i 

| «- OBSERVING INTERVAL -» \ 

Fig.A.l. Example of an observing window W with two gaps, 
its shifted version W r and the intersection WO W r . The window 
lengths are given in units of the observing interval x max - Xmi n . 
r=0.15 in the same units. 

Appendix A: An exact method to compute \W n W r \ 

The following presents a pseudo-code that implements an exact 
algorithm for the computation of \W n W r \ for a window W 
with gaps. It is assumed that the observing window W lies in 
the interval [x m j n , x max ] and has gaps [at, bk] for k — 1, m 
with 

Xmin = bo < a\ < b\ < ... < a„ k < b„ t < a„ l+ \ = x max . (A.l) 

Fig. lA.ll shows the situation for the case = 2. 

The algorithm considers all combinations of the intervals 
between the gaps, determines the lengths of their intersections 
and sums up. The wanted length \ W n W r \ is denoted by s. 
% input: 

% r — > interpoint distance 

% 

% initialization of indices: 
s = 0; 

11 = 1; 

12 = 1; 

% fix the shifted interval: 
yl = b[il - 1] + r; 
y2 = a[il] + r; 

if y2 > x max theny2 = x m . dx ; end 
% fix the original interval: 
zl = b[il - 1]; 
z2 = a[i2]; 

% compute the length of the intersections 
while (il < n/c + 1) and (yl < x max ) do 
if z2 < y2 then 

if z2 > yl then 

if zl > yl then 

s - s + z2 - zl; 

else 



= s +z2 - yl; 
• 1; 



1 then 

= s +y2 - zl; 

= s +y2 - yl; 

il = il + 1; 

yl = b[il - l]+r; 

if /l < «a + 1 then 

y2 = a[il] + r; 

if y2 > x max then 

— -^max? 

end 

end 

end 

end 

end 

return s 

Appendix B: An approximate fast method to 
compute \W n W r \ 

The method presented in the previous section provides an exact 
value of \W n W r \. If the computing time is of concern, a more 
efficient, although approximated method, is possible. The idea 
is to consider the observing window W as a function 'Wix) in R 
defined in the domain (— oo, +oo). When x 6 [x m i n ,x max ], then 
*W(x) = if x belongs to a gap, 'Wix) = 1, otherwise. If 
x £ [x min , x max ], it is *W(jc) = 0. 

It is not difficult to realize that, for a fixed r* , y(r*) = \W n 
W r ' | is given by 

f W(x)'W(x + r*)dx. (B.l) 

oo 

When r* is made to change in the interval (-oo, +oo), this equa- 
tion provides the autocorrelation function of < W(x). Hence, 

y(r) = IFT {FT^x)] x (FT[«W(jc)])'} . (B.2) 

The symbol " ' " means complex conjugation, whereas FT and 
IFT denote the Fourier transform and the inverse Fourier trans- 
form, respectively. 

An efficient implementation of Eq. (IB.21 > requires the inter- 
val [Xmin, x max ] be discretized in a set of Nb bins. In the case 
of the estimator £(r), this does not represent a true limit since 
it is a function that is computed on spatial bins of length 2h 
(see Eq. (fTTTi). The value of Nb to use in the discretization de- 
pends on factors as the extension of the gaps, the number of 
points and the range of distance r of interest (see also below). 
In any case, this is not a critical quantity. In the numerical ex- 
periments presented in the previous sections, Nb = 5000 is suf- 
ficient to obtain results almost indistinguishable from those ob- 
tainable with an exact approach. However, even values of Nb 
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much larger than this one (e.g., 10 5 ) determine only a limited 
increase of the computational burden. 

If w is an array of length 2Nb, with its central Nb elements 
containing function 'W(x) in the interval x € [x n 
pled on a discrete grid, then 



it Xn 



7 * IDFT {DFT[h'] (DFT[w])'} X Ax. 



(B.3) 



Symbol "O" indicates the point-wise multiplication, Ax is the 
sampling step and DFT and IDFT denote the fast discrete 
Fourier transform and the fast inverse discrete Fourier trans- 
form, respectively. The computational benefit of this approach 
is evident since the first Nb elements of y contain the values of 
y(r) for < r < (x max - Xmin). The fact of using an array w 
of length 2N/, corresponds to a zero-padding operation to avoid 
the aliasing effects typical of the discrete Fourier transform. 
Array y can be used to compute the weights [py] in Eq. dTTb . 

Appendix C: An approximate fast method to 
compute A s (r) and A v {r) 

As shown by Eqs. d9b-(fm>. the estimators K(r) and £(r) re- 
quire the intensities Ay(r) and A s (r). Again, the computation 
can be done with an exact algorithm as in Appendix lAl As an 
alternative, an approximative but very efficient Fourier based 
algorithm is possible that, similarly to Appendix iBl requires 
the discretization of the interval lx m in, x max ] in a set of Nb bins. 

Once y has been calculated, the computation of As (r) is a 
trivial operation. In fact, it is sufficient to check whether the 
elements of the array w that correspond to x, - r and x,- + r 
are equal to or to 1 . The situation is more complex for Ay(r) 
because of the terms 0(r) = Yi"=i W n \- x i ~ r > x i + r ]\ an d e ( r ) 

2 j!\wnw,\dt. 

For 6(r), the basic idea is to identify the term [x; - r, X; + r] 
as a function wj'(x) such as w'.(x) = 1 for x e [x,- - r, x; + r], 
w r Ax) = otherwise. The reason is that in this way w[(x) can 
be expressed in the convolution form 



X+oo 
IA 
DO 



|jc|)£(x - Xi)dx 7 



(CI) 



where 6(x) is the classic impulse function. Then, for a given r* 
it is possible to write 6{r*) in the form 



X+oo n 
«W(*)^wf (x)dx. 
CO ■ , 



Now, a discrete approximation of 0(r*) is provided by 



(C2) 



0(r*) * SUM [w O IDFT {DFT[/ f .] O DFT[x]}] x Ax. (C.3) 

SUM[z] is the operator that sums up all the elements of the ar- 
ray z, x is an Nb + No array that in its central Nb elements con- 
tains the number of points in each of the bins of W, whereas 
I r is a Nb + No array containing the discretized version of the 
function 7 r [.] stored in wraparound order with No providing the 
number of zero-pads necessary to avoid the aliasing effect due 
to the DFT. Here, also the array w has size Nb + No- The term 
DFT[/ r .] has not necessarily to be calculated numerically since 
an analytical expression is available (e.g., see lBriggs & Henson 



1996). In principle, Ni, should be chosen large enough that each 
entry of the array x is either or 1 . However, from our numer- 
ical experiments its comes out that a value of Nb such as 5-10 
bins correspond to the bandwidth h is sufficient to obtain satis- 
factory results. 

For a given r*, an approximation of e(r*) can be obtained 
from the array y by means of 



e(r') « 2 x SUM[y(l : i r ,)] x Ax. 



(C4) 



(1 : /, •) indicates the indices of the elements of the array to sum 
up and i r is the index of the bin corresponding to the distance 



Appendix D: An approximated fast method to 
compute A 2 %(r) and A 2 K{r) 

In the previous appendices some efficient numerical methods 
have been presented to compute \W n W r \ and As(r). These 
are some of the ingredients for the computation of the func- 
tion ^(r) that, however, requires also the total number, Ni,{r), of 
pairs with a distance in the range [r - h,r + h\. The simplest 
method to obtain this last quantity consists in counting. In the 
one-dimensional case, especially for limited data sets and/or 
small r, this represents the most efficient approach (we have 
used it in our numerical experiments). In the case of large data 
sets, an alternative method based on a Fourier approach is pos- 
sible, that permits to obtain Nhif) without the ne cessity to com 



pute the distance between the points (see also ISzapudi et al. 
20051) . 



The basic idea is that two points X; and Xj (x; < xj) have a 
distance in the range [r — h,r + h] if Xj e [x; + r - h, x, + r + h]. 
Now, this condition can be expressed in the equivalent form 



2h I kh(x - x, - r)5(x — xj)dx = 1. 

xj —CO 



(D.l) 



A similar expression holds when x, > xj. Hence, if r > h, Nh(r) 
is given 



N h (r) 



— 2h y y I kh(\x - Xj| - r)6(x — xf)dx. (D.2) 

■ . • . J-oo 



'=1 /=1 



When r < h, this number has to be subtracted by the number 
n of points (to avoid self-pairing). If discretized, for a fixed r* , 
this equation takes the form, 

N h (r*) « SUM [x O IDFT {DFT[A£] O DFT[x]}] , (D.3) 

where k 1 '^ is the discretized form of the function 2/ifc/,(x - r*) 
stored in wraparound order. At this point, the computation of 
£(r) should require the calculation of the weights {pij} that de- 
pend on the quantities {\W n W^-*,!}. In their turn, these quan- 
tities depend on the distances {|x,- - xj\] that are not available. 
However, if r » h (a typical situation of interest), from Eq. ( fTTT i 
it is 



\Wnw x . 



:h(\Xi ■ 



r) 



\W\ 



\wnw r 



k h (\Xi ■ 



r). (D.4) 
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With this approximation the quantity A 2 £(r) is given by 

\W\ 2 c(rl Ar . 
A t(r ) = Nh(r ). 

%v ; 2\wnw r .\ v ; 



(D.5) 



The denominator of this expression can be obtained from y. 

In Eq. dD.4t , the term DFT[^ ] can be computed with- 
out actually performing the DFT operation if the array k? = 
DFT[£"] is available. In fact, it is 



„r* _ „0 „ -i2jti>[0,1 iV-l]/JV 

K h - K h w e ' 



(D.6) 



with i = V-T, i r the index of the bin corresponding to the 
distance r, and N the length of the array k9. 

With arguments similar to those presented above, it is pos- 
sible to show that, if N(r*) is the total number of pairs with 
distance less or equal to a fixed r*, then 



N(r*) * SUM [x O IDFT { DFT[/,.. ] O DFT [x]}] 
and 

\W\ 2 



A K(r ) 



\wnw r 



■N(r*). 



(D.7) 



(D.8) 
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